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This paper is concerned with the problems of interaction screen¬ 
ing and nonlinear classification in a high-dimensional setting. We 
propose a two-step procedure, IIS-SQDA, where in the first step 
an innovated interaction screening (IIS) approach based on trans¬ 
forming the original p-dimensional feature vector is proposed, and 
in the second step a sparse quadratic discriminant analysis (SQDA) 
is proposed for further selecting important interactions and main ef¬ 
fects and simultaneously conducting classification. Our IIS approach 
screens important interactions by examining only p features instead 
of all two-way interactions of order 0{p^). Our theory shows that the 
proposed method enjoys sure screening property in interaction selec¬ 
tion in the high-dimensional setting of p growing exponentially with 
the sample size. In the selection and classification step, we establish 
a sparse inequality on the estimated coefficient vector for QDA and 
prove that the classification error of our procedure can be upper- 
bounded by the oracle classification error plus some smaller order 
term. Extensive simulation studies and real data analysis show that 
our proposal compares favorably with existing methods in interaction 
selection and high-dimensional classification. 

1. Introduction. Classification, aiming at identifying to which of a set 
of categories a new observation belongs, has been frequently encountered in 
various fields such as genomics, proteomics, face recognition, brain images, 
medicine and machine learning. In recent years, there has been a signif¬ 
icant surge of interest in interaction selection in classification due to the 
importance of interactions in statistical inference and contemporary scien¬ 
tific discoveries. For instance, in genome-wide association studies, it has been 
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increasingly recognized that gene-gene interactions and gene-environment 
interactions snbstantially influence the risk of developing a human disease 
[18]. Ignoring these interactions could potentially lead to misunderstand¬ 
ing about disease mechanisms as they are potential sources of the missing 
heritability [21]. 

Identification of interactions is challenging even when the number of pre¬ 
dictors p is moderately large compared to the sample size n, as the number 
of all possible pairwise interaction effects is of order 0{p^). This problem 
becomes even more challenging in the high-dimensional setting where p can 
be much larger than n. It is well known that the classical low-dimensional 
classification method cannot be directly used for high-dimensional classifica¬ 
tion for at least three reasons. First, many popular classifiers, such as linear 
discriminant analysis (LDA) and quadratic discriminant analysis (QDA), 
are inapplicable when p exceeds n because of the singularities of the sam¬ 
ple covariance matrices. Second, when p is large, it is commonly believed 
that only a subset of the p features contribute to classification. Classifica¬ 
tion using all potential features may cause difficulty in interpretation and 
degrade the classification performance due to the noise accumulation in es¬ 
timating a large number of parameters [7]. Third, the computational cost 
may be extremely high when the dimensionality is ultra-high. For example, 
with p = 1000 features, the dimensionality is about half million if all possible 
pairwise interactions are included in classification. 

In recent years, significant efforts have been made to develop effective 
high-dimensional classification methods. The most commonly imposed as¬ 
sumption is sparsity, leading to sparse classifiers. Tibshirani et al. [25] in¬ 
troduced the nearest shrunken centroids classifier, and Fan and Fan [7] pro¬ 
posed features annealed independent rules, both of which ignore correlations 
among features to reduce the dimensionality of parameters. Shao et al. [23] 
proposed and studied a sparse LDA method, which directly plugs the sparse 
estimates of the covariance matrix and mean vector into the linear classifier. 
Cai and Liu [4] introduced a direct approach to sparse LDA by estimating 
the product of the precision matrix and the mean difference vector of two 
classes, through constrained Li minimization. In an independent work, Mai 
et al. [20] also proposed a direct approach to sparse LDA, called DSDA, by 
reformulating the LDA problem as a penalized least squares regression. Fan 
et al. [9] considered HCT classifier for high-dimensional Gaussian classifica¬ 
tion with sparse precision matrix when the signals are rare and weak, and 
studied its optimality. A commonality of these aforementioned methods is 
that the underlying true classifier is assumed to be linear, and thus they 
belong to the class of sparse LDA methods. 

A key assumption for LDA is that observations from different classes share 
the same correlation structure. Although this assumption can significantly 
reduce the number of parameters need to be estimated, it can be easily 
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Table 1 

The means and standard errors (in parentheses) of various performance measures for 
different classification methods over 100 replications where the Bayes rule is given in (1). 
Sample size in each class is 100, and the number of features p is 200 


Measure 

PLR 

DSDA 

IIS-SQDA 

Oracle 

MR (%) 

49.95 (0.05) 

49.87 (0.09) 

26.03 (0.31) 

23.99 (0.08) 

FP.main 

49.66 (4.93) 

74.29 (6.45) 

3.16 (0.82) 

0 (0) 

FP.inter 

- 

- 

0.55 (0.14) 

0 (0) 

FN.inter 

- 

- 

0.15 (0.05) 

0 (0) 


violated in real applications. In addition, linear classifiers are not capable 
of identifying important interaction effects between features and thus can 
lead to inferior feature selection and classification results, and consequently, 
misleading interpretations when the classification boundary is nonlinear. For 
instance, in a two-class Gaussian classification problem, when two classes 
have equal mean vectors but different covariance matrices, linear classifiers 
can perform no better than random guessing. 

To gain some insight into the importance of interactions in classification, 
let us look at a simple example. Consider a two-class Gaussian classification 
problem with the Bayes rule 

Q(z) = -O.SZfo - 0.15^10^30 - 0.15^10^50 - 0.3Z|o 

2 

- O.I 5 Z 30 Z 50 - 0.3Z|o + 1.74913, 

which classifies a new observation z to class 1 if and only if Q{z) > 0. Thus 
there are no main effects, and there are three variables, Ziq^Z^q and Z 50 , 
contributing to interactions. We simulated data in the same way as model 
2 in Section 5.2.2, except that the mean vector in each class is zero. See 
Section 5.2.2 for more details. Table 1 lists the performance of different clas¬ 
sification methods, including penalized logistic regression (PLR), DSDA, our 
proposal (IIS-SQDA) and the oracle procedure (Oracle). The oracle proce¬ 
dure uses the information of the true underlying sparse model and thus is a 
low-dimensional QDA. As expected, both linear classifiers, PLR and DSDA, 
perform no better than random guessing. Table 1 also shows the variable 
selection results for main effects and interactions, with FP.main standing 
for false positives of main effects, and FP.inter and FN.inter standing for 
false positives and false negatives of interaction effects, respectively. It is 
seen that with appropriate selection of interaction effects, the classification 
performance can be improved significantly. 

In this paper we consider two-class classification with possibly unequal 
covariance matrices. Under some sparsity assumption on the main effects 
and interactions, we propose a two-stage classification procedure, where we 
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first reduce the number of interactions to a moderate order by a new inter¬ 
action screening approach, and then identify both important main effects 
and interactions using some variable selection techniques. Our interaction 
screening approach is motivated by a result, which will be formally demon¬ 
strated in our paper, that if an interaction term, say Z\Z 2 ^ appears in Bayes 
decision rule, then after appropri^ely transforming the original features, 
the resulting new feature Z\ (and Z^) has different variances across classes. 
Thus the original problem of screening 0{p^) pairwise interaction effects 
can be recast as the problem of comparing variances of only p variables, 
which can be solved by some variance test procedures such as the T-test or 
the SIRI method proposed in [16]. The similar idea of interaction screening 
has also been considered in [16] under the model setting of sliced inverse 
index model. Hereafter, we refer to Zi as an interaction variable if an inter¬ 
action term involving Zi appears in Bayes rule. After obtaining interaction 
variables in the first step, we reconstruct interaction terms based on these 
screened interaction variables, and then use recent advances in variable se¬ 
lection literature to further select important ones from the pool of all main 
effects and reconstructed interactions. Under some mild conditions, we prove 
that with overwhelming probability, all active interaction variables will be 
retained using our screening procedure. For the second step of selection and 
classification, we first establish a sparse inequality [27], which shows the con¬ 
sistency of the estimated coefficient vector of QDA, then further prove that 
the classification error of IIS-SQDA is upper-bounded by the oracle classifi¬ 
cation error plus a smaller order term. Our numerical studies demonstrate 
the fine performance of the proposed method for interaction screening and 
high-dimensional classification. 

The main contributions of this paper are as follows. First, we introduce an 
interaction screening approach, which has been proved to enjoy sure screen¬ 
ing property. Second, our classification method does not rely on the linearity 
assumption, which makes our method more applicable in real applications. 
Third, our proposed classification procedure is adaptive in the sense that it 
automatically chooses between sparse LDA and sparse QDA. If the index set 
of screened interaction variables is empty in the first step, or if the index set 
in the first step is nonempty but none of the interaction terms is selected in 
the second step, then sparse LDA will be used for classification; otherwise, 
sparse QDA will be used for classification. Fourth, we provide theoretical 
justifications on the effectiveness of the proposed procedure. 

The remaining part of the paper will unfold as follows. Section 2 intro¬ 
duces the model setting and motivation. Section 3 proposes the innovated 
interaction screening approach and studies its theoretical property. Section 4 
considers post-screening variable selection. Section 5 presents the results of 
extensive simulation studies and a real data example. Section 6 concludes 
with some discussion. Section 7 collects all proofs for the main theorems. 
Additional proofs are provided in the supplementary material [10]. 
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2. Model setting and motivation. Our interaction screening approach is 
motivated from the problem of two-class Gaussian classification, where the 
p-dimensional feature vector z = (Zi,..., Zp)'^ follows a mixture distribution 

(2) z = Az(^)+ (1- 

with z^^) a Gaussian random vector with mean and covariance matrix 
Xlfc, A; = 1,2, and the class label A following a Bernoulli distribution with 
probability of success tt. Without loss of generality, assume that /I 2 = 0. 
Under this model setting, the Bayes rule admits the following form: 

(3) Q(z) = ^z'^flz + S^z + C, 

where fl = ^ ^ ~ C is some constant depending only 

on TT, and Sfc, A: = 1,2. A new observation z is classihed into class 1 if 
and only if Q(z) > 0. 

When covariance matrices Si and S 2 are the same, the above Bayes rule 
takes the linear form Q(z) = 6^z + which is frequently referred to as the 
Fisher’s LDA and belongs to the family of linear classifiers. As discussed 
in the Introduction, linear classihers may be inefficient or even fail when 
the true classification boundary is nonlinear. Moreover, linear classifiers are 
incapable of selecting important interaction terms when the covariance ma¬ 
trices are different across two classes. For the ease of presentation, hereafter 
we mean interaction in the broad sense of the term, not just the two-way 
interactions ZjZ^ with j i, but also the quadratic terms Zj. So there are 
p{p + l)/2 possible interactions in total under our definition. Throughout 
this paper we call ZjZ^, 1 <j,i <p an active interaction if its coefficient is 
nonzero in (3), and we call Zj an interaction variable if there exists some 
i € {1,2,... ,p} such that ZjZi is an active interaction. Selecting important 
ones from the large number of interactions is interesting yet challenging. We 
next discuss our proposal for interaction screening. 

From (3), one can observe that an interaction term ZjZi is an active 
interaction if and only if 7 ^ 0. Here we use Aji to denote the (j, i) element 
of any matrix A. This observation motivates us to select active interactions 
by recovering the support of ft. Denote the index set of interaction variables 
by 

(4) T = {1 < j < p: ZjZi is an active interaction for some 1 <i < p}- 

In light of (3), the above set can also be written as X = {1 < j <p: flji 7 ^ 
0 for some 1 < ^ < p}. If the index set I can be recovered, then all active 
interactions can be reconstructed. For this reason, we aim at developing an 
effective method for screening the index set I. 

In a high-dimensional setting, to ensure the model identifiability and to 
enhance the model fitting accuracy and interpretability, it is commonly as¬ 
sumed that only a small number of interactions contribute to classification. 
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Thus we impose the sparsity assumption that there are only a small number 
of active interactions. Equivalently, we can assume that is highly sparse 
with only q = o(min{n,p}) rows (and columns, by symmetry) being nonzero, 
where n is the total sample size. Denote 5]^^ by k = l,2. Without loss 
of generality, we write Cl as 

(5) ft = Ct2 — Cll = ^ , 

where B is a g x 5 symmetric matrix with at least one nonzero element in 
each row. We remark that the block structure in (5) is just for the simplicity 
of presentation, and we do not require that the locations of nonzero rows of 
Cl are known. In fact, we will develop a method to estimate the indices of 
these nonzero rows. Note that the set Z can be further written as 

Z = {l<j <q: / 0 for some I <i <q}. 

Thus interaction screening is equivalent to finding the indices of features 
related to B. 

Identifying the index set Z is challenging when p is large. We overcome 
this difficulty by decomposing Z into two subsets. Let 

Zi = {j ^Z and Bjj < 0 }, X 2 = {j G X and Bj^ > 0 }. 

Then X = Xi U X 2 . This allows us to estimate X by dealing with Xi and X 2 
separately. 

First consider Xi. Our main idea is to use the transformation z = Liiz. 
Denote by z^^^ = the transformed feature vector from class k with 

k = 1,2. Then cov(z(^)) = f2i and cov(z(^)) = i 7 iS 2 rii. It follows from linear 
algebra that the difference of the above two covariance matrices takes the 
following form; 

(6) Xli = LiiS2Lii — = dZj2Cl — Cl = 

where is the q x q principal submatrix of X )2 corresponding to matrix 

B. We will show that if j G Xi, then the jth entry in transformed feature 
vector z has different variances across two classes. To this end, let ej be a 
unit vector with jth component 1 and all other components 0. Then it follows 
from the positive definiteness of that (Be^) is positive for 

any j G Xi. Since Bjj < 0 for any j G Xi, the jth diagonal element of X)i is 
positive by noting that 

(7) (Si),, = (Be,)^I]f)(Be,)-B,,. 

This gives a set inclusion 

( 8 ) ZiCAi = {j:{^i)jj^0}. 


/bx:|,^^^b-b o\ 

[ OT 0)’ 
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Observing that is the difference of between-class variances of the jth 

transformed variable, that is, 

(9) = var(ejz(2)) - var(ejz(^)), 

the index set Ai can be obtained by examining which features have different 
variances across two classes after the transformation. 

We further remark that the variance difference between ejz^^^ and ejz^^^ 
records the accumulated contributions of the jth. feature to the interaction. 
To understand this, note that if has the smallest eigenvalue bounded 
from below by a positive constant ri, then (7) and (9) together ensure that 

var(ejz^^^) — var(ejz^^^) > ri||Bej||2 — Bjj, 

where || • ||2 denotes the L 2 norm of a vector. In view of (3) and (5), the 
jth. column (and row, by symmetry) of B records all contributions of the 
jth feature to interactions. Thus the more important the jth feature is to 
interaction, the larger the variance difference. 

_ Similarly, consider the transformation z = S72Z, and define the matrix 
X )2 = ^2 ~ r22Sif^2- Then 512 is the difference between the covariance ma¬ 
trices of transformed feature vectors z^^^ = and z^^^ = ri 2 Z*'^). Using 

arguments similar to those in (8), we get another set inclusion 

(10) X 2 C ^2 = {j : {^2)n / 0}. 

Similarly, the set A 2 can be obtained by examining which features have 
different variances across two classes after the transformation based on 02 - 
Combining (8) and (10) leads to 

(11) Xc^iU.A2. 

Meanwhile, by (6) we have Ai C I. Similarly, we obtain A 2 C X. Combining 
these results with the set inclusion (11) ensures that 

( 12 ) I = AiUA2. 

This motivates us to find interaction variables by testing variances of the 
transformed feature vectors z and z across two classes. Since the transfor¬ 
mation based on precision matrix is called innovation in the time series 
literature, we name our method the innovated interaction screening (IIS). 

The innovated transform has also been explored in other papers. For ex¬ 
ample, Hall and Jin [14] proposed the innovated higher criticism based on 
the innovated transform on the original feature vector to detect sparse and 
weak signals when the noise variables are correlated, and established an 
upper bound to the detection boundary. In two-class Gaussian linear classi¬ 
fication setting. Fan et al. [9] discussed in detail the advantage of innovated 
transform. They showed that the innovated transform is best at boosting 
the signal-to-noise ratio in their model setting. Detailed discussions about 
innovated transform in the multiple testing context can be found in [17]. 
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3. Sampling property of innovated interaction screening. 


3.1. Technical assumptions. We study the sampling properties of IIS 
procedure in this section. In our theoretical development, the Gaussian 
distribution assumption in Section 2 will be relaxed to sub-Gaussian, but 
implicitly, we still assume that our target classifier takes the form (3). A 
random vector w = (Wi,..., Wp)'^ G is sub-Gaussian if there exist some 
positive constants a and b such that P(|v^w| > t) < aexp(—for any 
t > 0 and any vector v G satisfying ||v ||2 = 1. The following conditions 
will be needed for our theoretical development: 

Condition 1 (Sub-Gaussian). Both zT) and are sub-Gaussian. 


Condition 2 (Bounds of eigenvalues). There exists some positive con¬ 
stant Ti and some positive sequence T 2 ,p depending only on p such that the 
eigenvalues o/Si and S 2 satisfy 

T ~1 ^ -^111111(^1;) ^ '^max(^fe) ^ "^ 2 ,^ for k — 1 , 2 , 

where Amin(‘) and A ma x(-) denote the smallest and largest eigenvalues of a 
matrix, respectively. 


Condition 3 (Distinguishability) . Denote by and the 

population variances of the jth covariates in z, z^^) and z^^), respectively. 
There exist some positive constants k and c such that for any j G Ai with 
Ai defined in (8), it holds that 


(13) 




(aW)2-(af))2(i-) 


> exp(3cn ''). 


j ^ ^3 

Moreover, the same inequality also holds for the jth covariates in z, z^^) and 
z^^) when j G A 2 with A 2 defined in (10). 


Condition 4 (ATp-sparsity). For each k = l,2, the precision matrix ftk 
is Kp-sparse, where a matrix is said to be Kp-sparse if each of its row has 
at most Kp nonzero components with Kp a positive integer depending only 
on p. Moreover, 110^11 max is bounded from above by some positive constant 
independent ofp, where || • ||max is the elementwise infinity norm of a matrix. 


Condition 1 is used to control the tail behavior of the covariates. Gaus¬ 
sian distribution and distributions with bounded support are two special 
examples of sub-Gaussian distribution. 

Condition 2 imposes conditions on the eigenvalues of the population co- 
variance matrices Si and S 2 . The lower bound ti is a constant while the 
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upper bound can slowly diverge to infinity with p. So the condition num¬ 
bers of X)i and X )2 can diverge with p as well. We remark that we need 
a constant lower bound r to exclude the case of perfect or nearly perfect 
collinearity of features at the population level. On the technical side, the 
constant lower bound ri ensures that and are still sub-Gaussian 
after transformation. 

Condition 3 is a signal strength condition which assumes that for any j G 
.Ai, the population variances of the jth transformed feature Zj are different 
enough across classes, by noting that 

2 

(14) Dj = logcr| - log[(cr(^^)^] > 3cn~'^ 

k=l 

with TTi = vr and 7r2 = 1 — vr when j ^ Ai- Meanwhile, it is clear from the 
definition of Ai that Dj is exactly 0 when j G Af since the population 
variances of the jth transformed covariate Zj are the same across classes. 
The same results hold for any feature with index in A 2 , after transforming 
the data using ^ 2 , based on the second part of this condition. 

Condition 4 is on the sparsity of the precision matrices, which is needed 
for ensuring the estimation accuracy of precision matrices. The same family 
of precision matrices has also been considered in [9] for high-dimensional 
linear classification. Condition 4 also imposes a uniform upper bound for all 
components of flk- We note that we use this assumption merely to simplify 
the proof, and our main results will still hold with a slightly more com¬ 
plicated form when the upper bound diverges slowly with the number of 
predictors p. 

3.2. Oracle-assisted IIS. In this subsection, we consider IIS with known 
precision matrices, which we call the oracle-assisted IIS. The case of un¬ 
known precision matrices will be studied in the next subsection. The results 
developed here are mainly of theoretical interests and will serve as a bench¬ 
mark for the performance of IIS with unknown precision matrices. 

As introduced in Section 2, IIS works with the transformed feature vectors 
z = fliz and z = ri 2 Z identically. For the ease ofj)resentation we only discuss 
in detail IIS based on the transformation z = {Zi, ..., Zp)'^ = fiiz. 

Suppose we observe n data points {{zJ, Aj), i = 1 ,..., n}, of which are 
from class k for /c = 1,2. Write Z = Zfii as the transformed data matrix, 
where Z = (zi,... ,z„)^ is the original data matrix. To test whether the jth 
transformed feature Zj has different variances across two classes, we propose 
to use the following test statistic introduced in [16]: 

2 

Dj=\oga] - 

k=l 


(15) 


10 


FAN, KONG, LI AND ZHENG 


where d‘j denotes the pooled sample variance estimate for Zj, and 
is the within-class sample variance estimate for Zj in class k. As can be 
seen from (15), Dj is expected to be nonzero if variances of Zj are different 
across classes. This test statistic was originally introduced in [16] in the sliced 
inverse index model setting for detecting important variables with pairwise 
or higher-order interactions among p predictors. The aforementioned paper 
recommends the use of Dj in the initial screening step of their proposed 
procedure, and proves the sure screening property of it under some regularity 
conditions. 

Denote by fi^p = min{ 7 rr 2 ~p -t- (1 — 7 r)rir 2 ~p, 1} and f 2 ,p = max{ 7 rT]“^ -|- 
(1 — 7 r)T]“^r 2 ,p + 7 r(l — 7 r)T]“^||/X]^|| 2 , exp(l)}. The following proposition shows 
that the oracle-assisted IIS enjoys the sure screening property in interaction 
selection under our model setting. 

Proposition 1. Assume that Conditions 1-3 hold. Iflogp = 0{n'^) with 
7 > 0 and j + 2k < 1, and -|-log^(f 2 ,p) = , then with probability 

at least 1 — exp{—C'n^“^'^/[f{'p -|- log^(f 2 ,p)]} for some positive constant C, 
it holds that 

min Dj > 2cn~'^ and maxDi < cn~'^, 
j€Ai ■' j&Al 

for large enough n, where c is defined in Condition 3. The same results also 
hold for the sets A 2 and A 2 with the test statistics being calculated using 
data transformed by CI 2 . 

The assumption + log^(f 2 ,p) = restricts how fast the up¬ 

per bound T 2 ,p in Condition 2 can diverge with the number of predictors p. 
Proposition 1 entails that the oracle-assisted IIS can identify all indices in 
Ai U A 2 with overwhelming probability, by thresholding the test statistics Dj 
with threshold chosen in the interval {cn~'^, 2cn~^). In view of (12), Proposi¬ 
tion 1 gives the variable selection consistency of the oracle-assisted IIS; that 
is, the set of true interaction variables T can be selected with asymptotic 
probability one. This result holds for ultra-high dimensional p satisfying 
logp = 0{n'^) with 0 < 7 < 1 — 2k. The key step in proving the theorem is 
to analyze the deviation bound of Dj from its population counterpart Dj. 
More details can be found in the supplementary material [10]. 

3.3. IIS with unknown precision matrices. In most applications, the pre¬ 
cision matrices and CI 2 are unknown and need to be estimated. There is 
a large body of literature on estimating precision matrices. See, for exam¬ 
ple, [1, 5, 12, 22, 28, 29, 31], among others. These methods share a common 
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assumption that the underlying true precision matrix is sparse. In this pa¬ 
per, we focus on the family of Xp-sparse precision matrices as introduced in 
Condition 4. For the estimation, we use the following class of estimators. 

Definition 1 (Acceptable estimator). A p x p symmetric matrix fl is 
an acceptable estimator of the it'p-sparse population precision matrix 17 if it 
satisfies the following two conditions: (1) it is independent of the test data 
and is ATp-sparse with Kp a sequence of positive integers depending only on 
p, and (2) it satishes the entry-wise estimation error bound ||n — 17Umax < 
CiKp^ (logp)/n with some positive constant Ci. 

The same class of estimators has been introduced in and used in [9]. As 
discussed in [9], many existing precision matrix estimators such as CLIME 
[5] and Glasso [12] are acceptable under some regularity conditions. Other 
methods for estimating precision matrices can also yield acceptable esti¬ 
mators under certain conditions; see [9] for more discussions on acceptable 
estimators. 

For each A: = 1,2, given an acceptable estimator f7fc of S7fc) our IIS ap¬ 
proach transforms the data matrix as Z17fc. Similar to the last subsection, 
we only discuss in detail IIS based on the transformation ZLii. Then the 
corresponding test statistic Dj is 

2 

(16) Dj = logdi - ^(nfc/n)log[(d]^V], 

k=l 

where (t| is the pooled sample variance estimate for the jth feature after 
the transformation ZI7i, and (dj^^)^ is the class k sample variance estimate 
for the jth feature after the transformation for fc = 1,2. 

With an acceptable estimate f7i of f2i, the transformed data matrix Zf7i 
is expected to be close to the data matrix Z17i. Correspondingly, the test 
statistics Dj are expected to be close to the test statistics Dj defined in 
(14), which ensures that the same selection consistency property discussed 
in Proposition 1 is inherited by using test statistics Dj . This result is formally 
summarized below in Theorem 1. 

Dehne Ai = {1 < j <P'-Dj > oon} with cjn > 0 the threshold level depend¬ 
ing only on n. Let 

Tn,p = Cif{pT 2 ,p{Kp + Kp)Kl^ {\ogp) / nuiax{{Kp + Kp)Kp^ (logp)/n, 1}, 

where Ci is some positive constant, and fi^p and T 2 ,p are the same as in 
Proposition 1. 
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Theorem 1. Assume that the conditions in Proposition 1 are satisfied 
and that for each k = l,2, is an acceptable estimator of the true precision 
matrix In addition, assume that Condition 4 is satisfied and Tn^p = 
o{n~^). Then with probability at least 1 — exp{—+ log^(f 2 ,p)]} 
for some positive constant C, it holds that 

Al = Al with Un G {an - fin, On) 

for large enough n, where an = 2cn“'' — Tn^p and fin = cn~'^ — 2Tn^p with 
c defined in Condition 3. The same result holds for sets A 2 and A 2 with 
A 2 defined analogously to Ai using the test statistics calculated with data 
transformed by fl 2 - 

As shown in the proof of Theorem 1 in Section 7, it holds that 

min ZD,- > an and maxDj < an — fin, 
ieTi jGAI 

with asymptotic probability one. The term fin measures how different the 
test statistics are in and outside of set U^ 2 - Thus, by thresholding the 
test statistics Dj with appropriately selected threshold level, the index set 
U ^2 can be identified with asymptotic probability one, and consequently, 
our IIS method enjoys the variable selection consistency as described in 
Theorem 1. We will discuss the implementation of IIS with test statistics 
(16) in detail in Section 5. 

Compared to Proposition 1, the lower bound of the test statistics over Ai, 
which is given by is smaller than the one in Proposition 1, reflecting the 
sacrifice caused by estimating precision matrices. The additional assumption 
on Tn^p is related to the sparsity level and estimation errors of precision 
matrices. Under these two assumptions, an and fin are close to 2cn~'^ and 
cn~'^, the bounds given in Proposition 1, respectively, implying a relatively 
small price paid in estimating precision matrices. 

4. Post-screening variable selection. Denote by X = .Ai U .4.2 the index 
set identihed by the IIS approach. Let d = \X\ be its cardinality. Then the 
variable selection consistency of IIS guarantees that T is the true set of in¬ 
teraction variables X with asymptotic probability one. By the sparsity of fi 
assumed in Section 2, the cardinality d is equal to q = o(min{n,p}) with over¬ 
whelming probability. With selected variables in X, interactions can be re¬ 
constructed as B = {ZjZi, for all j,i G X}, which indicates that IIS reduces 
the dimensionality of interactions from 0{p^) to less than o(min{n^,p^}) 
with overwhelming probability. Important questions are how to further se¬ 
lect active interactions and how to conduct classihcation using these selected 
interactions. 
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In the classification literature, variable selection techniques have been 
frequently used to construct high-dimensional classihers, for example, the 
penalized logistic regression [13, 32], the LPD rule [4], and the DSDA ap¬ 
proach [20], among many others. In this paper, we use the idea of penalized 
logistic regression to further select important main effects and interactions. 
Before going into details, we first introduce some nation. For a feature vector 
z = (Zi,.. .,Zpf, let x= (l,Zi,... ,Zp,Z^,ZiZ 2 ,..., Zp_iZp, be thep- 
dimensional full augmented feature vector with p = {p+l){p + 2)/2. Assume 
that the conditional probability of success vr(x) = P(A = Ijx) = P(A = Ijz) 
is linked to the feature vector x by the following logistic regression model: 

(17) logit(7r(x)) = log 

I — vr(xj 

where 9 is the regression coefficient vector. Based on (17), a new observation 
z is classified into class 1 if and only if x^9 > 0. We remark that if both z^^^ 
and z(^) are Gaussian distributed, the decision rule derived from the logistic 
regression model (17) is identical to the Bayes rule (3), which is our main 
reason of using penalized logistic regression for selecting important main 
effects and interactions. 

Write ^ C {1,... ,p}, the set of indices formed by the intercept, all main 
effects Zi,...,Zp and interactions Z^Zi with If there is no inter¬ 

action screening and Z = {1,... ,p}, then ^ = {1,... ,p}, meaning that all 
pairwise interactions are used in post-screening variable selection step. 

Denote by X = (xi,..., x„)^ = (xi,..., Xp) the full augmented design ma¬ 
trix with Xj the full augmented feature vector for the ith observation Zj. In 
order to estimate the regression coefficient vector 6, we consider the reduced 
feature space spanned by the 1 -|- p -|- d{d + 1)/2 columns of X with indices 
in tZ and estimate 9 by solving the following regularization problem: 

(18) 9= argmin < ^^(xf 0, Aj)-|-pen(0) I, 

0GM.P ^0 ^c=0 i=l J 

where is the complement of i{x^9, A) = — A(x'^0) -|-log[l-|-exp(x^0)] 
is the logistic loss function and pen(0) is some penalty function on the pa¬ 
rameter vector 9. Various penalty functions have been proposed in the lit¬ 
erature for high-dimensional variable selection; see, for example. Lasso [24], 
SCAD [8], SICA [19] and MCP [30], among many others. See also [11] for 
the asymptotic equivalence of various regularization methods. Due to the 
existence of interactions, the design matrix X can have highly correlated 
columns. To overcome the difficulty caused by potential high collinearity, in 
our application we propose to use the elastic net penalty [33], which takes 
the form pen(0) = Aijj0jji -|- A 2 II 0 II 2 with Ai and A 2 , two nonnegative regu¬ 
larization parameters. Similar types of penalty functions have also been used 



14 


FAN, KONG, LI AND ZHENG 


and studied in [3] and [15]. Note that solving the regularization problem (18) 
in the reduced parameter space ^ is computationally more efficient than 
solving it in the original p-dimensional parameter space. 

Generally speaking, the post-screening variable selection is able to reduce 
the number of false positive interactions. Thus, only when there are inter¬ 
actions surviving both the screening step and variable selection step, sparse 
QDA will be used for classification; otherwise, sparse LDA will be used 
for classification. In this sense, our approach is adaptive and automatically 
chooses between sparse LDA and sparse QDA. 

4.1. Oracle inequalities. Denote by 5 = supp(0o) the support of the true 
regression coefficient vector Oq and its complement. Let s = jS*] be the 
cardinality of the set S. For any d = ((5i,..., 5p)^ G we use 5$ to denote 
the subvector formed by the components Sj with j G S. The following con¬ 
ditions are needed for establishing the oracle inequalities for 6 dehned in 
(18): 

Condition 5. There exists some positive constant 0 < vrmin <1/2 such 
that TT min < ^(A = Ijz) < 1 — TTmin for all z. 

Condition 6. There exists some constant (/> > 0 such that 
(19) > ^‘^S^Ss 

for any S satisfying ||^s=||i < 4(s^/^ -|- A]“^A2||0o||2)||^5||2; where S = 

Condition 5 is a mild condition which is commonly imposed in logistic re¬ 
gression and ensures that the conditional variance of the response variable is 
uniformly bounded away from zero. Condition 6 is inspired by the restricted 
eigenvalue (RE) assumptions in [2], where it was introduced for establishing 
the oracle inequalities for the lasso estimator [24] and the Dantzig selector 
[6]. The set on which (19) holds in Condition 6 also involves A]~^A2ll0o||2) 
which is needed to deal with the L 2 term in the elastic net penalty [33]. A 
similar condition has been used in [15] for studying the oracle inequalities 
for the smooth-Lasso and other + £2 methods in ultrahigh-dimensional 
linear regression models with deterministic design and no interactions. In 
our setting, the logistic regression model with random design and the exis¬ 
tence of interactions add extra technical difficulties in establishing the oracle 
inequalities. 

Theorem 2. Assume that all conditions in Theorem 1 and Conditions 
5-6 are satisfied. Moreover, assume that Ai > co-\/log(p)/n with some pos¬ 
itive constant cq, 5s^/^ -|- 4A]“^A2ll0o||2 = and log(p) = o(n^/^“^^) 
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with some constant 0 < ^ < 1/4. Then with probability at least 1 — 
exp{—+ log^(f 2 ,p)]} — 0{p~^^), it holds simultaneously that 

\\d - 0o||i < 32 C- + A 2 || 0 o|| 2 )VAi, 

n-^/2||x(0 _ 00)11^ < 4C-V(Ais'/" + A 2 II 0 OII 2 ), 

where C is some positive constant. Moreover, the same results hold with 
probability at least 1 — 0{p~‘^^) for the regularized estimator 0 without the 
interaction screening step, that is, without the constraint 0y-c = 0 in (18). 

Theorem 2 presents the oracle inequalities for the regularized estimator 
6 defined in (18). It extends the oracle inequalities in Theorem 1 of [15] 
from the linear model with deterministic design and no interactions to the 
logistic regression model with random design and interactions. Dealing with 
interactions and large random design matrix needs more delicate analysis. It 
is worth pointing out that the results in Theorem 2 also apply to the regu¬ 
larized estimator with d = p, that is, the case without interaction screening. 

4.2. Oracle inequality for miselassifieation rate. Recall that based on the 
logistic regression model (17), the oracle classifier classihes a new observation 
z to class 1 if and only if x^0o > 0, where x is the p-dimensional augmented 
feature vector corresponding to z. Thus the oracle miselassifieation rate is 

R = TiR{2\l) + (1 -7r)R(l|2), 

where R{i\j) is the probability that a new observation from class j is mis- 
classified to class i based on the oracle classifier. As discussed in the last 
subsection, the oracle classifier x^0o is the Bayes rule if the feature vectors 
zi^i and zi^i from classes 1 and 2 are both Gaussian. 

Correspondingly, given the sample {{zj ,the miselassifieation rate 
of the plug-in classifier x^0 with 6 defined in (18) takes the following form: 

R„ = 7rR,(2|l) + (l-7r)R,(l|2), 

where Rn{i\j) is the probability that a new observation from class j is mis- 
classihed to class i by the plug-in classifier. 

We introduce some notation before stating our theoretical result on mis- 
classification rate. Denote by Fi{x) and F 2 {x) the cumulative distribution 
functions of the oracle classifier x^0o under classes 1 and 2, respectively. 
Let 

r„ = max| sup |T((x)|, sup |T 2 (x)||, 

^xG[— eo,eo] 3;G[—eOiCo] ^ 

where cq is a small positive constant, and F[{x) and F^^x) are the first-order 
derivatives of Fi{x) and F 2 {x), respectively. 
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Condition 7. Let A„ = log(p)(Ais^/^ + A2||0o||2)^/Ai. It holds that A„ = 
o(l) and r„A„ = o(l). 

Theorem 3. Assume that all conditions in Theorem 2 and Condition 7 
are satisfied. Then with probability at least 1 — exp{—+ 
log^(f 2 ,p)]} - we have 

( 20 ) 0<Rn<R + O{p-^fi + O{rnAn) 

for all sufficiently large n, where ci is some positive constant. Moreover, the 
same inequality holds with probability at least 1 — 0{p~‘^^) for the plug-in 
classifier based on the regularization estimator 6 without interaction screen¬ 
ing. 

Theorem 3 ensures that with overwhelming probability, the misclassifi- 
cation rate of the plug-in classifier is at most 0{p~'^^) -\- OirnAn) worse 
than that of the oracle classifier. If is upper-bounded by some con¬ 
stant, Ai = 0(y^(logp)/n), and A 2 II 0 OII 2 = O(s^^^Ai), then (20) becomes 
0 < i?n < R + 0{p~^'^) + 0{s{\ogpfi/‘^n~^/‘^). In the setting of two-class Gaus¬ 
sian classification, the misclassification rate Rn can also be lower bounded 
by R, by noting that the oracle classifier x^0o is the Bayes rule. Thus the 
plug-in classifier is consistent. This result is formally summarized in the 
following corollary. 

Corollary 1. Assume that bothzAl and are Gaussian distributed. 
Then under the same conditions as in Theorem 3, with probability at least 
1 — exp{—C'n^“^'^/[f{’p -|-log^(f 2 ,p)]} — 0(p“'^i), it holds that 

R<Rn<R + 0{p-^^ ) + OirnAn). 

5. Numerical studies. 

5.1. Implementation. We apply the SIRI method in [16] to implement 
IIS in our proposal. See Section 5.2 in [16] for more details on how to choose 
thresholds in SIRI. The R code for SIRI is available at http: //www. people. 
fas.harvard.edu/~junliu/SIRI/. 

It is worth mentioning that as recommended in [16], SIRI is implemented 
as an iterative stepwise procedure. That is, the next active interaction vari¬ 
able is chosen based on the current set of interaction variables rather than 
using a one-time hard-thresholding to select all interaction variables. The 
iterative stepwise procedure is more stable in practice. Jiang and Liu [16] 
proved the nice property of SIRI method in selecting interaction variables in 
the sliced inverse index model setting. We remark that the same theoretical 
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results hold under our model setting as long as an extra condition similar 
to the stepwise detectable condition in [16] is imposed on the population 
variances. Since the proofs are very similar to the ones in [16], to save space, 
we do not formally state the results here. Instead, we refer the readers to 
[16] for more details. 

In the second stage of our proposal, we employ the R package glmnet for 
variable selection. An refitting step after selection is added when calculating 
classification error. For the ease of presentation, our two-stage procedure is 
referred to as IIS-SQDA. 

For comparisons, we also include LDA, QDA, penalized logistic regres¬ 
sion (PLR), DSDA and the oracle procedure (Oracle). The LDA and QDA 
methods are implemented by directly plugging in the sample estimates of 
the unknown parameters. The oracle procedure uses the information of the 
true underlying sparse model and is thus a low-dimensional QDA. For PLR, 
we consider two different versions, PLR and PLR2, where the former uses 
main effects only and the latter includes additionally all possible pairwise 
interactions. For fair comparison, an refitting step is also conducted for PLR 
and PLR2, as we do for IIS-SQDA. 

5.2. Simulation studies. We conducted two simulation studies to evalaute 
the performace of IIS-SQDA. The class 1 distribution is chosen to be 
Si) with /^i = Sid and Si = and the class 2 distribution is chosen to 
be iV(0,S2) with S 2 = 1^2 where fii, VI 2 and 5 will be specified later. 

5.2.1. Study 1. We demonstrate the performance of the oracle-assisted 
IIS approach and examine the resulting classification and variable selection 
performance. The results presented here can be used as a benchmark for 
evaluating the performance of IIS with unknown precision matrices. We 
consider the following setting for 5 and precision matrices f2i and 0,2'- 

Model 1: (rii)ij = fl 2 = ^1 -|- where f2 is a symmetric and 

sparse matrix with ^ 5^5 = 1 ^ 25,25 = ^^ 45,45 = —0.29 and ^ 5^25 = ^^ 5,45 = 
^ 25,45 = —0.15. The other 3 nonzero entries in the lower triangle of ft are 
determined by symmetry, d = (0.6,0.8,0,... ,0)^. The dimension p is 2000. 

Thus there are two main effects and six interaction terms under our broad 
dehnition of interaction in the Bayes rule (3). 

We use two performance measures, false positive (FP), and false negative 
(FN), to evaluate the screening performance of IIS. FP is defined as the 
number of irrelevant interaction variables falsely kept while FN is defined as 
the number of true interaction variables falsely excluded by IIS. An effective 
variable screening procedure is expected to have the value of FP reasonably 
small and the value of FN close to zero. The former implies that the variable 
screening procedure can effectively reduce the dimensionality whereas the 
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Table 2 

The means and standard errors (in parentheses) of various performance measures by 
different classification methods for study 1 based on 100 replications 


Measure 

PLR 

DSDA 

IIS-SQDA 

Oracle 

MR (%) 

40.59 (0.40) 

38.04 (0.35) 

15.09 (0.40) 

12.07 (0.06) 

FP.main 

25.69 (5.45) 

22.80 (5.10) 

2.63 (0.66) 

0(0) 

FP.inter 

- 

- 

0.62 (0.12) 

0 (0) 

FN.main 

1.80 (0.04) 

0.82 (0.05) 

1.22 (0.05) 

0(0) 

FN.inter 

- 


0.47 (0.10) 

0 (0) 


latter implies that the sure screening property holds. The means and stan¬ 
dard errors (in parentheses) of FP and FN for interaction variables based on 
100 replications are 0.63 (0.08) and 0.14 (0.03), respectively, in the screen¬ 
ing step. This demonstrates the fine performance of our IIS approach in 
selecting interaction variables. 

We further investigate the classification and variable selection perfor¬ 
mance of our proposal. Five performance measures are employed to summa¬ 
rize the results. The first measure is the misclassification rate (MR), which is 
calculated as the proportion of observations in an independently simulated 
test set of size 10,000 being allocated to the incorrect class. The second and 
third are FP.main and FP.inter, which represent the numbers of irrelevant 
main effects and irrelevant interaction effects falsely included in the classi¬ 
fication rule, respectively. The fourth and fifth are FN.main and FN.inter, 
which represent the numbers of relevant main effects and relevant interac¬ 
tion effects falsely excluded in the classification rule, respectively. Note that 
the definitions of FP.inter and FN.inter here are different from those screen¬ 
ing performance measures FP and FN, which are defined earlier. In fact, 
FP.inter and FN.inter are defined with respect to the number of interaction 
effects whereas screening performance measures FP and FN are defined with 
respect to the number of interaction variables. 

The variable selection and classification results for different methods are 
reported in Table 2. PLR2 is not computationally efficient in this case due to 
the huge number of two-way interactions. The conventional LDA and QDA 
are not applicable as ni = n 2 = 100 < p. So we only compare the variable 
selection and classification performance of our proposal, IIS-SQDA, with 
DSDA, PLR and the Oracle. It is made clear that IIS-SQDA has better 
classihcation performance than PLR and DSDA. 

5.2.2. Study 2. In this study, we evaluate the performance of the IIS 
approach with the estimated precision matrices and examine the resulting 
classihcation and variable selection performance. We consider the following 
four different model settings for precision matrices: 
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Model 2: fii = Ip, 0,2 = f2i + f2, where f2 is a symmetric and sparse 
matrix with ^ 10,10 = ^^30,30 = ^^50,50 = — 0.6 and ^ 10,30 = ^^io,50 = ^^30,50 = 
—0.15. The other 3 nonzero entries in the lower triangle of ft are determined 
by symmetry, d = ( 0 . 6 , 0 . 8 , 0 ,..., 0 )^. 

Model 3: fii is a band matrix with (r2i)jj = 1 for i = 1,... ,p and = 

0.3 for |i — j| = 1. 0,2 = rii + ri where is a symmetric and sparse matrix 
with riiOjio — —0.3785, r2io,30 — 0.0616, r2iQ^5o = 0.2037, ^ 30,30 — —0.5482, 
^ 30,50 = 0.0286 and ^ 50,50 = —0.4614. The other 3 nonzero entries in the 
lower triangle of are determined by symmetry. <5 = (0.6,0.8,0,... ,0)^. 

Model 4: Similar to model 1 in the last subsection, except for the dimen¬ 
sion p. 

Model 5: fii is a block diagonal matrix comprised of equal blocks A, 
where A is a 2 -by -2 matrix with diagonal elements equal to 1 and off- 
diagonal elements equal to 0.4. O 2 = fii -|- f2 where is a symmetric and 
sparse matrix with ^ 3^3 = Oq^q = fig^g = rii 2 ,i 2 = —0.2, O^^ = rigq 2 = 0.4 
and O^^g = ri 3 q 2 = ^ 6,9 = ^ 6,12 = —0.4. The other 6 nonzero entries in the 
lower triangle of O are determined by symmetry. The nonzero elements of S 
are located at coordinates 3, 6 , 9 and 12. The corresponding values for these 
nonzero elements are simulated from a uniform distribution over [0.3,0.7] 
and remain unchanged during simulations. 

For each model, we consider three different dimentionalities, p = 50, p = 200 
and p = 500. There are two main effects and six interaction terms (including 
quadratic terms) in the Bayes rules for models 2-4, four main effects and ten 
interaction terms in the Bayes rules for model 5. In models 2~4 no interaction 
variables are main effect variables whereas in model 5 all interaction variables 
are also main effect variables. 

We use the same measures as in study 1 to examine the variable screening 
performance of the IIS approach and the variable selection and classihcation 
performance of IIS-SQDA. The means and standard errors (in parentheses) 
of FP and FN for these models based on 100 replications are reported in 
Table 3, which shows the effectiveness of our interaction screening approach. 
For comparison purposes, we also include in Table 3 the screening results by 
oracle-assisted IIS. It is interesting to observe that the IIS with estimated 
precision matrices gives smaller FNs than and comparable FPs to the IIS 
with true precision matrices. 

Tables 4-7 summarize the variable selection and classification results 
based on 100 replications. We observe the following: 

(1) IIS-SQDA exhibits the best performance in terms of MR and inter¬ 
action selection across all settings. 

(2) PLR2 also has good classification accuracy in low-dimensional situa¬ 
tions {p = 50), but it has inferior interaction selection results than IIS-SQDA 
in all settings. 


20 


FAN, KONG, LI AND ZHENG 


Table 3 

Interaction screening results for models 2-5. The numbers reported are the means and 
standard errors (in parentheses) of FP and FN based on 100 replications 


p 

Model 

IIS with true fJi and fl 2 

IIS with estimated f2i 

and fl 2 

FP 

FN 

FP 

FN 

50 

Model 2 

0.45 (0.08) 

0.02 (0.01) 

1.57 (0.15) 

0.01 

(0.01) 


Model 3 

0.86 (0.09) 

0.48 (0.06) 

1.93 (0.15) 

0.15 

(0.04) 


Model 4 

1.68 (0.13) 

0.09 (0.03) 

1.04 (0.11) 

0.01 

(0.01) 


Model 5 

1.79 (0.16) 

0.02 (0.02) 

1.54 (0.13) 

0.01 

(0.01) 

200 

Model 2 

0.43 (0.08) 

0.04 (0.02) 

1.16 (0.13) 

0.02 

(0.01) 


Model 3 

0.74 (0.09) 

0.48 (0.05) 

1.03 (0.14) 

0.15 

(0.04) 


Model 4 

1.52 (0.12) 

0.08 (0.03) 

0.44 (0.07) 

0.03 

(0.02) 


Model 5 

1.10 (0.12) 

0.36 (0.08) 

0.90 (0.10) 

0.04 

(0.02) 

500 

Model 2 

0.42 (0.07) 

0.11 (0.03) 

0.68 (0.09) 

0.01 

(0.01) 


Model 3 

0.53 (0.06) 

0.73 (0.07) 

0.65 (0.09) 

0.21 

(0.04) 


Model 4 

1.25 (0.12) 

0.09 (0.03) 

0.43 (0.07) 

0.03 

(0.02) 


Model 5 

0.85 (0.10) 

0.42 (0.09) 

0.59 (0.09) 

0.03 

(0.02) 


(3) All linear classifiers have poor performance when the true classifica¬ 
tion boundary is nonlinear. 

(4) Comparing QDA with LDA shows that including all possible inter¬ 
actions may not necessarily improve the classification performance. This is 
not surprising because QDA has many more parameters to estimate than 
LDA, while the sample size is very limited. Thus interaction selection is very 
important, even with moderate dimensionality. 

(5) Comparing the results of QDA with those of PLR2 or IIS-SQDA, we 
observe that the classification performance can be improved substantially 
by using interaction screening and selection. Particularly, in most cases, the 
improvement becomes more significant as the dimensionality increases. 

Another phenomenon we observed in simulation is that when the number 
of predictors p is as high asp = 500, PLR2 requires a huge memory space that 
it easily causes memory outflow in a regular office PC with 8 GB memory. 

In addition, note that the misclassification rates of all methods in model 5 
are significantly higher than that of the Oracle classifier. We emphasize that 
it is due to the small true coefficients in the Bayes rule and the relatively 
complex true model. In fact, the setting of model 5 is so challenging that all 
other methods have close to or over 40% MR when p = 200 or 500. 

5.3. Real data analysis. We apply the same classification methods as 
in Section 5.2 to the breast cancer data, originally studied in [26]. The 
purpose of the study is to classify female breast cancer patients according 
to relapse and nonrelapse clinical outcomes using gene expression data. The 
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Table 4 

The means and standard errors (in parentheses) of various performance measures by 
different classification methods for model 2 based on 100 replications 


p 

Method 

MR (%) 

FP.main 

FP.inter 

FN.main 

FN.inter 

50 

LDA 

37.91 

(0.13) 

48 (0) 

- 

0 (0) 

- 


QDA 

39.89 

(0.11) 

48 (0) 

1269 (0) 

0 (0) 

0 (0) 


PER 

32.83 

(0.23) 

2.37 (0.49) 

- 

1.09 (0.03) 

- 


DSDA 

32.70 

(0.18) 

4.61 (0.74) 

- 

0.10 (0.03) 

- 


PLR2 

22.56 

(0.33) 

0.13 (0.05) 

3.17 (0.70) 

0.35 (0.05) 

0.75 (0.09) 


IIS-SQDA 

21.78 

(0.22) 

3.67 (0.67) 

1.32 (0.23) 

0.08 (0.03) 

0.09 (0.04) 


Oracle 

19.86 

(0.08) 

0 (0) 

0 (0) 

0 (0) 

0 (0) 

200 

PER 

33.64 

(0.31) 

4.29 (1.34) 

- 

1.09 (0.03) 

- 


DSDA 

33.33 

(0.26) 

10.83 (2.25) 

- 

0.18 (0.04) 

- 


PER2 

24.65 

(0.51) 

0.11 (0.05) 

7.71 (2.27) 

0.42 (0.06) 

0.93 (0.09) 


IIS-SQDA 

22.14 

(0.30) 

4.48 (0.91) 

0.54 (0.11) 

0.09 (0.03) 

0.15 (0.05) 


Oracle 

19.66 

(0.06) 

0 (0) 

0 (0) 

0 (0) 

0 (0) 

500 

PER 

34.59 

(0.39) 

6.00 (1.46) 

- 

1.12 (0.03) 

- 


DSDA 

33.87 

(0.28) 

14.76 (3.10) 

- 

0.17 (0.04) 

- 


PER2 

26.83 

(0.58) 

0.07 (0.04) 

8.95 (2.02) 

0.56 (0.06) 

1.53 (0.11) 


IIS-SQDA 

22.09 

(0.30) 

3.25 (1.02) 

0.25 (0.08) 

0.25 (0.05) 

0.69 (0.09) 


Oracle 

19.65 

(0.06) 

0 (0) 

0 (0) 

0 (0) 

0 (0) 


The 

Table 5 

: means and standard errors (in parentheses) of various performance measures by 
different classification methods for model 3 based on 100 replications 

P 

Method 

MR (%) 

FP.main 

FP.inter 

FN.main 

FN.inter 

50 

EDA 

39.43 (0.15) 

48 (0) 

- 

0 (0) 


„ 


QDA 

43.47 (0.10) 

48 (0) 

1269 (0) 

0 (0) 

0 

(0) 


PER 

36.12 (0.26) 

5.95 (0.93) 

- 

1.21 (0.04) 


- 


DSDA 

35.05 (0.22) 

8.81 (1.06) 

- 

0.07 (0.03) 


- 


PER2 

30.15 (0.44) 

0.51 (0.14) 

11.26 (2.78) 

0.60 (0.05) 

2.62 

(0.09) 


IIS-SQDA 

27.56 (0.27) 

5.60 (0.82) 

2.16 (0.32) 

0.19 (0.04) 

2.05 

(0.09) 


Oracle 

24.13 (0.07) 

0 (0) 

0 (0) 

0 (0) 

0 

(0) 

200 

PER 

37.62 (0.34) 

7.82 (1.87) 

- 

1.47 (0.05) 


- 


DSDA 

36.34 (0.30) 

15.06 (3.37) 

- 

0.36 (0.05) 


-- 


PER2 

32.55 (0.53) 

0.25 (0.06) 

17.44 (3.63) 

0.90 (0.05) 

2.72 

(0.08) 


IIS-SQDA 

26.94 (0.31) 

6.43 (1.24) 

0.78 (0.17) 

0.42 (0.05) 

2.22 

(0.08) 


Oracle 

22.99 (0.07) 

0 (0) 

0 (0) 

0 (0) 

0 

(0) 

500 

PER 

38.82 (0.33) 

9.31 (1.99) 

- 

1.58 (0.05) 


- 


DSDA 

37.10 (0.29) 

16.06 (3.02) 

- 

0.42 (0.05) 


- 


PER2 

35.45 (0.64) 

0.34 (0.09) 

55.69 (12.67) 

0.99 (0.05) 

3.05 

(0.10) 


IIS-SQDA 

26.78 (0.31) 

3.22 (1.09) 

0.23 (0.05) 

0.98 (0.02) 

2.65 

(0.09) 


Oracle 

23.00 (0.08) 

0 (0) 

0(0) 

0 (0) 

0 

(0) 
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Table 6 

The means and standard errors (in parentheses) of various performance measures by 
different classification methods for model 4 based on 100 replications 


p 

Method 

MR (%) 

FP.main 

FP.inter 

FN.main 

FN.inter 

50 

LDA 

38.84 (0.16) 

48 (0) 

- 

0 (0) 

- 


QDA 

31.10 (0.16) 

48 (0) 

1269 (0) 

0 (0) 

0 (0) 


PLR 

36.06 (0.24) 

5.89 (0.78) 

- 

1.39 (0.05) 

- 


DSDA 

35.36 (0.21) 

10.41 (1.18) 

- 

0.24 (0.04) 

- 


PLR2 

16.55 (0.40) 

0.40 (0.08) 

22.80 (1.72) 

1.08 (0.06) 

0.33 (0.06) 


IIS-SQDA 

15.49 (0.33) 

9.51 (1.34) 

2.91 (0.38) 

0.39 (0.05) 

0.04 (0.03) 


Oracle 

12.14 (0.06) 

0 (0) 

0 (0) 

0 (0) 

0 (0) 

200 

PLR 

38.01 (0.30) 

9.86 (2.04) 

- 

1.64 (0.05) 

- 


DSDA 

36.39 (0.25) 

13.98 (2.18) 

- 

0.46 (0.05) 

- 


PLR2 

16.79 (0.48) 

0.09 (0.03) 

19.99 (1.76) 

1.40 (0.05) 

0.48 (0.08) 


IIS-SQDA 

13.98 (0.28) 

2.30 (0.72) 

0.26 (0.09) 

0.98 (0.05) 

0.10 (0.05) 


Oracle 

12.12 (0.07) 

0 (0) 

0 (0) 

0 (0) 

0 (0) 

500 

PLR 

39.51 (0.35) 

12.98 (2.13) 

- 

1.72 (0.05) 

- 


DSDA 

37.90 (0.29) 

24.04 (3.94) 

- 

0.53 (0.05) 

- 


PLR2 

16.38 (0.52) 

0.06 (0.02) 

16.79 (1.36) 

1.43 (0.05) 

0.74 (0.10) 


IIS-SQDA 

14.10 (0.28) 

2.11 (0.57) 

0.16 (0.07) 

1.07 (0.05) 

0.12 (0.06) 


Oracle 

12.11 (0.06) 

0 (0) 

0 (0) 

0 (0) 

0 (0) 


The 

Table 7 

: means and standard errors (in parentheses) of various performance measures by 
different classification methods for model 5 based on 100 replications 

P 

Method 

MR (%) 

FP.main 

FP.inter 

FN.main 

FN.inter 

50 

LDA 

43.18 (0.14) 

46 (0) 

- 

0 (0) 

- 


QDA 

41.69 (0.12) 

46 (0) 

1265 (0) 

0 (0) 

0 (0) 


PLR 

40.16 (0.26) 

4.77 (0.73) 

- 

1.93 (0.10) 

~ 


DSDA 

38.89 (0.26) 

7.98 (1.22) 

- 

1.12 (0.10) 

- 


PLR2 

34.55 (0.39) 

1.06 (0.22) 

19.51 (3.53) 

2.14 (0.11) 

4.16 (0.13) 


IIS-SQDA 

27.68 (0.23) 

7.64 (0.86) 

2.11 (0.28) 

0.90 (0.09) 

2.61 (0.18) 


Oracle 

22.30 (0.10) 

0 (0) 

0 (0) 

0 (0) 

0 (0) 

200 

PLR 

42.15 (0.32) 

18.18 (3.10) 

- 

2.22 (0.12) 

- 


DSDA 

39.22 (0.32) 

16.23 (3.82) 

- 

1.36 (0.11) 

- 


PLR2 

41.50 (0.38) 

0.34 (0.08) 

72.73 (10.99) 

2.66 (0.10) 

5.24 (0.14) 


IIS-SQDA 

30.04 (0.32) 

11.29 (1.83) 

0.91 (0.18) 

1.52 (0.10) 

4.08 (0.17) 


Oracle 

22.24 (0.08) 

0 (0) 

0 (0) 

0 (0) 

0 (0) 

500 

PLR 

43.83 (0.32) 

29.19 (4.70) 

- 

2.36 (0.13) 

- 


DSDA 

40.03 (0.32) 

20.54 (4.30) 

- 

1.58 (0.10) 

- 


PLR2 

44.92 (0.32) 

0.77 (0.13) 

123.39 (15.77) 

2.97 (0.09) 

7.19 (0.15) 


IIS-SQDA 

32.84 (0.32) 

19.59 (3.32) 

0.57 (0.10) 

1.61 (0.12) 

4.61 (0.18) 


Oracle 

22.12 (0.07) 

0 (0) 

0 (0) 

0 (0) 

0 (0) 
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Table 8 

Misclassification rate and model size on the breast cancer data in [26] over 100 random 
splits. Standard errors are in the parentheses 





Model size 


Method 

MR (%) 

Main 

Interaction 

All 

DSDA 

PLR 

PLR2 

IIS-SQDA 

23.62 (0.74) 
21.72 (0.78) 
40.47 (0.61) 
19.97 (0.77) 

37.38 (1.57) 
45.04 (1.35) 
14.87 (1.81) 
47.77 (1.16) 

19.95 (3.28) 

3.03 (0.32) 

37.38 (1.57) 
45.04 (1.35) 
34.82 (4.77) 
50.80 (1.31) 


total sample size is 78 with 44 patients in the good prognosis group and 34 
patients in the poor prognosis group. There are some missing values with 
one patient in the poor prognosis group so it was removed from study here. 
Thus ni = 44 and n 2 = 33. Our study uses the p = 231 genes reported in 
[26]. 

We randomly split the 77 samples into a training set and a test set such 
that the training set consists of 26 samples from the good prognosis group 
and 19 samples from the poor prognosis group. Correspondingly, the test 
set has 18 samples from the good prognosis group and 14 samples from the 
poor prognosis group. For each split, we applied four different methods, PLR, 
PLR2, DSDA and IIS-SQDA to the training data and then calculated the 
classification error using the test data. The tuning parameters were selected 
using the cross-validation. We repeated the random splitting for 100 times. 
The means and standard errors of classification errors and model sizes for 
different classification methods are summarized in Table 8. The average 
number of genes contributing to the selected interactions over 100 random 
splittings were 22.96 and 2.86 for PLR2 and IIS-SQDA, respectively. We can 
observe that our proposed procedure has the best classification performance. 

6. Discussion. We have proposed a new two-stage procedure, IIS-SQDA, 
for two-class classification with possibly unequal covariance matrices in the 
high-dimensional setting. The proposed procedure first selects interaction 
variables and reconstructs interactions using these retained variables and 
then achieves main effects and interactions selection through regularization. 
The fine performance of IIS-SQDA has been demonstrated through theoret¬ 
ical study and numerical analyses. 

For future study, it would be interesting to extend the proposed proce¬ 
dure to multi-class classification problems. In addition, IIS transforms the 
data using the CLIME estimates of the precision matrices fli and CI 2 , which 
can be slow to calculate when the number of predictors p is very large. One 
possible solution is to first reduce the dimensionality using some screening 
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method and then apply our IIS-SQDA for interaction screening and classi¬ 
fication. We are also in the process of developing a scalable version of the 
IIS which significantly improves the computational efficiency. 

7. Proofs of main theorems. In this section, we list main lemmas and 
present the proofs for main theorems. The secondary lemmas and additional 
technical proofs for all lemmas are provided in the supplementary material 
[10]. 

7.1. Lemmas. We introduce the following lemmas which are used in the 
proofs of Theorems 1-3. 

Lemma 1. Under model setting (2) and the conditions in Theorem 1, 
for sufficiently large n, with probability at least 1 — pexp(—Crf it 

holds that 

max |(T^/d| - 1| < T„,p/6 

i<j<p ^ 

for some positive constant C, where is the same as in Theorem 1. 

Lemma 2. Under Condition 5, we have 

(21) Cn“^||X5||| -Fpen(0) < ||n"^£^X||^||<5||i -L pen(0o), 

where C is some positive constant depending on the positive constant in 
Condition 5, S = 6 — Oq is the estimation error for the regularized estimator 
6 defined in (18) and e = y — £^(y|X) with y = (Ai,..., A„)^. 

Lemma 3. Assume that Condition 1 holds. If\og{p) = o{n), then with 
probability 1 — , we have ||n“^e'^X||oo < 2~^CQyJ\og{p)/n, where cq 

is some positive constant and e = y — £'(y|X) with y = (Ai,..., An)"’". 

Lemma 4. Assume that Conditions 1 and 6 hold. If -I- 4A)"^ x 
A 2 ||^o ||2 = 0(n^^‘^) and log(p) = with constant 0 < ^ < 1/4, then 

when n is sufficiently large, with probability at least 1 — where C 2 

is some positive constant, it holds that 

n-’/^\\XS\\2>{(P/2)\\Ss\\2 

for anySeW satisfying ||<55c||i < 4 ( 5 ^/^-F A/U 2 || 0 o|| 2 )||^s|| 2 - 

Lemma 5. Assume that w = {Wi ,..., Wp)^ G MP is sub-Gaussian. Then 
for any positive constant ci, there exists some positive constant C 2 such that 

p| max \Wj\ > C' 2 V^log(p)| = 0{p~‘'^). 




INNOVATED INTERACTION SCREENING 


25 


7.2. Proof of Theorem 1. Since we have the inequality 
(22) \Dj — Dj\ < \Dj — Dj\ + \Dj — Dj\, 

the key of the proof is to show that with overwhelming probability, Dj and 
Dj are uniformly close as n ^ oo. Then together with Proposition 1, we can 
prove the desired result in Theorem 1. The same notation C will be used to 
denote a generic constant without los^of generality. 

We proceed to prove that Dj and Dj are uniformly close. By definitions 
of Dj and Dj, along with the fact that \nk/n\ < 1 for k = 1 and 2, we 
decompose the difference between Dj and Dj as 


(23) 


max 

i<i<p 


\Dj Dj\ 


< max I log (7^ — logd^ 


i<i<p 


+ max |log[(df^)^] - log[((TfV]|. 

^ \<j<p 


k=l 


The following argument is conditioning on the event, denoted by £i, such 
that the results hold in Lemma 1. Then a'j and ci| are uniformly close. Since 
log(l + Xn) —7> 1 as Xn —?> 0, it follows that 

log(d|/d|)/(<T|/d| - 1) ^ 1 

uniformly for all j as n —>■ oo. Thus, with a sufficiently large n uniformly 
over j, we have 

max |log(T| - logci|| > r„,p/3|£’i^ 

(24) 

< max |<7|/d-| — 1| > < pexp(—Cff 

By a similar argument, we can derive for k = 1,2, 

^(max^|log[((Tj^^)^] -log[(d]^^)^]| >r„,p/3|£’i) <pexp(-Cffpn^"^'^). 

In view of (23), we get 


p( max \Dj — Dj \ > ) 

\i<j<p ’ / 

< max^|log(T| - logo'll > r„,p/3|£’i^ 

2 

+ V] - log[(d|''^)^]| > r„,p/3|£:i) 

k=l ~ ~ 

<pexp{-Cflpn^-^^). 
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By Lemma 1, P{£i) <pexp(—CffIt follows that 

p( max \Dj — Dj\> Tn,p) < P( max \Dj — Dj\> Tn^p\£i) + -P(^i) 

(25) 

<pexp{-Cfiy-^'^). 

Therefore, for any p satisfying logp = 0{rP) with 0 < 7 < 1 — 2/t and 
we get that for large enough n, 

P^^m^ \Dj — Dj\> Tn,p^ < exp(— 

Since the same conditions hold for the matrices S 2 and fl 2 , using similar 
arguments we can prove that the same results hold for the covariates in 
A 2 with the test statistics calculated using the transformed data Zi£l 2 - This 
completes the proof of Theorem 1. 

7.3. Proof of Theorem 2. By Theorem 1, it is sufficient to show the 
second part of Theorem 2. The main idea of the proof is to hrst define an 
event which holds high probability and then analyze the behavior of the 
regularized estimator 6 conditional on that event. 

Define e = y — P(y|X) with y = (Ai,..., An)"'"■ Since log(p) = 
it follows from Condition 1 and Lemma 3 that for any Ai > cq Y^log(p) /n, 

P{||n"^e'^X||oo > 2"Ui} < P{||n"^e'^X||^ > 2"^co\/log(p)/re} = 

where ci is some positive constant. Meanwhile, from Lemma 4, under the 
assumptions that 5s^/^ + 4 A^^A 2 || 0 o ||2 = and log(p) = we 

have n“^/^||Xd ||2 > {(t>/2)S'g6s for any 5 € satisfying ||5sc||i < 4(s^/^ + 
A^^A 2 || 0 o|| 2 )||^s ||2 when n is sufficiently large, with probability at least 1 — 
0{p~^^). Combining these two results we obtain that with probability at 
least 1 — 0{p~^^) — 0{p~^^) = 1 — 0{p~^^) with ci = min{ci,C 2 }, it holds 
simultaneously that 

(26) ||n-^£^X||^< 2 -Ui, 

(27) n-^/^XS\\2>{(^/2)S^Ss, 

for any Ai > cqY^ log(p)/n and 5 € satisfying < 4(s^/^ + 

A^^A 2 || 0 o|| 2 )||^s ||2 when n is sufficiently large. From now on, we condition 
on the event that inequalities (26) and (27) hold. 

It follows from Condition 5 and Lemma 2 that 

Cn-i||X5||i + Ai||0||i + X 2 \\d\\l < ||n-ie^X||^||5||i + Ai||0o||i + himl 

where C is some positive constant. Thus, by inequality (26), we have 

Cn ^||X5||| + Ai||0||i + A 2 II 0 II 2 ^ 2 ^Aill^lli + Ai||0o||i + "^ 2 ||^o|| 2 - 
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Recall that 8 = 0 — 6q. Adding 2 ^||5||i — 2X20q{0 — Oq) to both sides of 
the above inequality and rearranging terms yield 

Cn-^||X5||i + 2-iAi||5||i + A2||5||i 
(28) . . 

< Ai(||0o||i — ll^lli + 11^ — ^o||i) — 2A20O “ ^o)- 

Note that ||0o|[i - ll^lli + ||^ - ^o||i = ||^o,5||i - ||%||i + 11% - %,5||i since 
I^Oj I “ l^jl + Wj ~ ^o,j \ = 0 for all j G S’'^. By the triangle inequality and the 
Cauchy-Schwarz inequality, we have 

(29) ||6»o||i - ||0||i + \\d - 6>o||i < 2||% - 6 »o,5||i < 

Note that \0q{0 — 0o)| = \^o,si^S — ^o,s)|- An application of the Cauchy- 
Schwarz inequality gives 

(30) 1^0 (^ - %)l < ||%,5||2||^S - ^O.slb = ||^o||2||<5s|| 2- 
Combining these three results in (28)-(30) yields 

(31) Cn-i||X5||i + 2-iAi||5||i + A2||5||i < 2{X,s^/^ + A2||%||2)||%||2, 
which, together with the fact that ||5sc||i < ||<5||i, implies a basic constraint 

< 4(s^/^ + A)'^A2||0 o||2)||5s||2- 

Thus, by inequality (27), we have n“^/^||X5 II 2 > (<^/2)^5^s- This, together 
with (31), gives 

A-^C(l)^\\Ss\\l < Cn-^\\XS\\l < 2(Ais1/2 ^ A2||%||2)||%||2. 

Solving this inequality yields ||^ 5||2 < 8C'“^(/>“^(Ais^/^ + A2||0o||2)- Combin¬ 
ing this with (31) entails that 

Cn-^\\XS\\l + 2-'Ai||(5||i + A2||<5||i < IOC"V'(Ais^/' + A2||%||2)' 

holds with probability at least 1 — 0{p~'^^). Thus from the above inequality 
we have 

||0 - %||i = ||<5||i < 32C-+ A2||%||2 )VAi, 

n-i/2||x(0 - 0o)|l2 = n-i/2||x5||2 < 4C-V^(Ais'/2 + A 2 II 0 OII 2 ), 

hold simultaneously with probability at least 1 — 0{p~‘^^). This completes 
the proof of Theorem 2. 
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7.4. Proof of Theorem 3. Recall that z = (Zi,..., ZpY' = + (1 — 

A)z(2) and X = (1, Zi,..., Z^, Zf, Z 1 Z 2 ,..., Zp_iZp, Define an event 

^2 = {||0 - 0o||i < 32C-+ A2||0 o||2)VAi}, 

where positive constant C is given in Theorem 2. From Theorem 2, we have 
^{^ 2 ) — 0{p~'^^). By Lemma 5, under Condition 1, there exists a positive 
constant C 2 such that 

(32) p| max |zj^^| > C 2 Vlog(p)| < 

for fc = l,2, where {Z^\ ..., Z^'^Y' = Define an event ^’3 = {||z||oo < 
d 2 '\/log(p)}. Then P{£^) < 0{p~'^^). An application of the Bonferroni in¬ 
equality gives 


(33) P(f| U ^f) < 0{p-Y + = 0{p-Y- 

Denote by Ci the event {z from class 1}. Note that on the event £ 3 , we 
have ||x||oo < (72 log(p) where we use a generic constant C 2 to simplify no¬ 
tation. Using the property of conditional probability gives 

P„( 2 |l) = P(x ^0 < 0 |Ci) = P(x^ 0 o < x^( 0 o - e)\Ci) 

(34) ^ ^ 

< P(x^0o < x^(0o - e) |Ci,Z2 n Z3) + p{£2 u £^)■ 

Note that conditioning on the event Z 2 H £^ 3 , x^(0o — ^) can be bounded as 
|x'^( 0 o- 0 )| < ||x||oo|| 0 - 0 o||i 

< ?.2C-^C2r^\og{p){\isP^ + A2||0o||2)VAi. 

Then |x^(0o ~ ^)| A C^An with positive constant C 3 = 32C~^C24>~‘^■ Thus 
we have 


£’(x^ 0 o < x^( 0 o - d)\Ci,£2 n £3) 
< P{yY 6q < C 3 An\Ci ,£2 n £3) 


= P(x^0O<(73An|Cl,Z3) 


P(x^0O<(73An,g3|Cl) 

£^(^31^1) 


P(x^ 0 o<C 3 AJCi) _Fi(C 3 A„) 
p(Z3|Ci) p(£:3|Ci) ’ 


where Pi(-) is the cumulative distribution function of x^0o|Ci- This inequal¬ 
ity, together with (34), entails 


Rn{2\l)< 


Fl{C3An) 

pmci) 


+ P(5|UZ|). 
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By the definition of Fi{-), we have -R(2|l) = -Fi(O). Thus 


(35) 


Rn{2\l) - R{2\1) 


< 


Fi(C3An)-Fi(0) 

p{£3\Ci) 


+ 


1 




- 1 


F^{0) + P{S^U£l] 


From Condition 7, we have 0 < A„ < eo when n is sufficiently large and 
C^An = o(l). It follows that Fi(C 3 A„) — Ti(0) = F'(A*)A„ < C^VnAn where 
A* is between 0 and C^An- In view of (32), we have 


P(£:3|Ci) = P{ maxjzf )| < C2^/^^} = 0{p-^^). 


Combining this with (33) and (35) entails 


72„(2|l)-i?(2|l)< 


C^VnAn 
1 - 0 {p-^^) 


+ 


I - 0{p-^^) 


+ 0(p-"i) + 0(p-"i) 


= 0{rnAn) + 0{p ""i). 


Similarly, we can show that i?n(l|2) < 7?(1|2) + 0(r„A„) + 0{p ‘^i). Com¬ 
bining these two results completes the proof of Theorem 3. 
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SUPPLEMENTARY MATERIAL 

Supplement to “Innovated interaction screening for high-dimensional non¬ 
linear classification” (DOI: 10.1214/14-AOS1308SUPP; .pdf). We provide 
additional lemmas and technical proofs. 
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